Optimization apparatus, optimization method, and non-transitory computer-readable medium in which optimization program is stored

ABSTRACT

An optimization apparatus (100) includes a setting unit (110) that sets a predetermined non-linear objective function, a policy determination unit (120) that determines a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function, a policy execution unit (130) that acquires a reward as an execution result of the determined policy, an update rate determination unit (140) that determines an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function, and an update unit (150) that updates the non-linear objective function, based on the update rate.

TECHNICAL FIELD

The present invention relates to an optimization apparatus, an optimization method, and an optimization program, and more particularly, to an optimization apparatus, an optimization method, and an optimization program that perform online optimization in a banded problem.

BACKGROUND ART

Online optimization techniques are known in a field of decision making, such as policy determination in a marketing field. In the online optimization, an optimal policy is determined based on setting that a value of an objective function is acquired each time a certain policy is executed. Moreover, in reality, there is a case in which values of the objective function are acquired only partially in the online optimization (a bandit problem). Specifically, when a certain policy A is executed, a value of an objective function associated to the policy A (e.g., a reward acquired by executing the policy A) can be acquired. However, a value of the objective function to be acquired when executing a policy B is unknown. Therefore, there has been a technique of online optimization in a case where values of the objective function can be acquired only partially in a linear function. Furthermore, Non Patent Literature 1 discloses a technique related to online optimization of a policy for a non-linear function.

Non Patent Literature 2 discloses a theory of linear and integer programming. Non Patent Literature 3 discloses a technique related to bandit convex optimization. Non Patent Literature 4 discloses a technique related to the geometry of logarithmic concave functions and sampling algorithms. Non Patent Literature 5 discloses a statistical study on logarithmic concavities and strong logarithmic concavities. Non Patent Literature 6 discloses a technique related to a multiplicative weights update method. Non Patent Literature 7 discloses a technique related to optimization of an approximately convex function.

CITATION LIST Non Patent Literature

[Non Patent Literature 1] A. D. Flaxman, A. T. Kalai, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: gradient descent without a gradient” [online], 30 Nov. 2004, [Search on Oct. 2, 2019], Internet <URL: http://www.cs.cmu.edu/{tilde over ( )}mcmahan/soda2005.pdf>

[Non Patent Literature 2] A. Schrijver, Theory of linear and integer programming, John Wiley & Sons, 1998.

[Non Patent Literature 3] E. Hazan and K. Levy, Bandit convex optimization: Towards tight bounds, In Advances in Neural Information Processing Systems, pages 784-792, 2014.

[Non Patent Literature 4] L. Lovasz and S. Vempala, The geometry of logconcave functions and sampling algorithms, [online], March 2005, [Search on Oct. 2, 2019], Internet <URL: https://www.cc.gatech.edu/{tilde over ( )}vempala/papers/logcon.pdf>

[Non Patent Literature 5] A. Saumard and J. A. Wellner, Log-concavity and strong log-concavity: a review, Statistics surveys, 8:45, 2014.

[Non Patent Literature 6] S. Arora, E. Hazan, and S. Kale, The multiplicative weights update method: a meta-algorithm and applications, Theory of Computing, 8(1):121-164, 2012.

[Non Patent Literature 7] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin, Escaping the local minima via simulated annealing: Optimization of approximately convex functions, In Conference on Learning Theory, pages 240-265, 2015.

SUMMARY OF INVENTION Technical Problem

However, the technique according to Non Patent Literature 1 has insufficient precision in online optimization in a case where values of an objective function are only partially acquired for the non-linear objective function.

The present disclosure has been made in order to solve such a problem, and an object thereof is to provide an optimization apparatus, an optimization method, and an optimization program for achieving high-precision optimization in online optimization in a case where values of an objective function are only partially acquired for a non-linear objective function.

Solution to Problem

An optimization apparatus according to a first aspect of the present disclosure includes:

a setting unit that sets a predetermined non-linear objective function;

a policy determination unit that determines a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;

a policy execution unit that acquires a reward as an execution result of the determined policy;

an update rate determination unit that determines an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and

an update unit that updates the non-linear objective function, based on the update rate.

An optimization method according to a second aspect of the present disclosure includes,

by a computer:

setting a predetermined non-linear objective function;

determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;

acquiring a reward as an execution result of the determined policy;

determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and

updating the non-linear objective function, based on the update rate.

An optimization program according to a third aspect of the present disclosure causes a computer to execute:

setting processing of setting a predetermined non-linear objective function;

policy determination processing of determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;

policy execution processing of acquiring a reward as an execution result of the determined policy;

update rate determination processing of determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and

update processing of updating the non-linear objective function, based on the update rate.

Advantageous Effects of Invention

According to the present invention, it is possible to provide an optimization apparatus, an optimization method, and an optimization program for achieving high-precision optimization in online optimization in a case where values of an objective function are only partially acquired for a non-linear objective function.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a configuration of an optimization apparatus according to a first example embodiment.

FIG. 2 is a flowchart illustrating a flow of an optimization method according to the first example embodiment.

FIG. 3 is a block diagram illustrating a configuration of an optimization apparatus according to a second example embodiment.

FIG. 4 is a flowchart illustrating a flow of an optimization method according to the second example embodiment.

DESCRIPTION OF EMBODIMENTS

Hereinafter, example embodiments of the present disclosure will be described in detail with reference to the drawings. In the drawings, the same or associated elements are denoted by the same reference numerals, and duplicate descriptions are omitted as necessary for clarity of description.

First Example Embodiment

FIG. 1 is a block diagram illustrating a configuration of an optimization apparatus 100 according to the first example embodiment. The optimization apparatus 100 is an information processing apparatus that performs online optimization in a bandit problem. Herein, the bandit problem is a problem in which a case where a content of an objective function changes every time a solution (policy) is executed by using the objective function, and only a value (reward) of an objective function in a selected solution can be observed is set. Thus, online optimization in the bandit problem is online optimization when values of the objective function are only partially acquired.

The optimization apparatus 100 includes a setting unit 110, a policy determination unit 120, a policy execution unit 130, an update rate determination unit 140, and an update unit 150. The setting unit 110 sets a predetermined non-linear objective function. The policy determination unit 120 determines a policy to be executed in the online optimization in the bandit problem, based on the non-linear objective function. The policy execution unit 130 acquires a reward as an execution result of the determined policy. The update rate determination unit 140 determines an update rate of a non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function. Herein, the multiplicative weight update method is, for example, a method disclosed in Non Patent Literature 6. The update unit 150 updates the non-linear objective function, based on the update rate.

FIG. 2 is a flowchart illustrating a flow of the optimization method according to the first example embodiment. First, the setting unit 110 sets a predetermined non-linear objective function (S1). Next, the policy determination unit 120 determines a policy to be executed in the online optimization in the bandit problem, based on the non-linear objective function (S2). Then, the policy execution unit 130 acquires a reward as an execution result of the determined policy (S3). Subsequently, the update rate determination unit 140 determines an update rate of the non-linear objective function by the multiplicative weight update method, based on the acquired reward and the non-linear objective function (S4). Thereafter, the update unit 150 updates the non-linear objective function, based on the update rate (S5).

As described above, in the present example embodiment, in the online optimization in a case where values of the objective function are only partially acquired for the non-linear objective function, the update rate by the multiplicative weight update method is determined from the determined policy, and the non-linear objective function is updated by the update rate. Therefore, high-precision optimization can be achieved.

The optimization apparatus 100 includes a processor, a memory, and a storage device as a configuration not illustrated. The storage device stores a computer program in which processing of the optimization method according to the present example embodiment is implemented. The processor then causes a computer program to be read from the storage device into the memory and executes the computer program. As a result, the processor achieves functions of the setting unit 110, the policy determination unit 120, the policy execution unit 130, the update rate determination unit 140, and the update unit 150.

Alternatively, each of the setting unit 110, the policy determination unit 120, the policy execution unit 130, the update rate determination unit 140, and the update unit 150 may be achieved by dedicated hardware. In addition, a part or all of each component of each apparatus may be achieved by a general-purpose or special-purpose circuitry, a processor, or the like, or a combination thereof. These may be configured by a single chip, or may be configured by a plurality of chips connected via a bus. A part or all of each component of each apparatus may be achieved by a combination of the above-described circuitry or the like with a program. As the processor, a central processing unit (CPU), a graphics processing unit (GPU), a field-programmable gate array (FPGA), or the like can be used.

When a part or all of the components of the optimization apparatus 100 are achieved by a plurality of information processing apparatuses, circuitries, and the like, the plurality of information processing apparatuses, circuitries, and the like may be concentratedly arranged or dispersedly arranged. For example, the information processing apparatus, the circuitry, and the like may be achieved as a client-server system, a cloud computing system, and the like, each of which is connected via a communication network. Functions of the optimization apparatus 100 may be provided in a software as a service (SaaS) format.

Second Example Embodiment

Bandit convex optimization (BCO) is an online decision-making framework with limited (partial) feedback. In this framework, a player is given a convex feasible region K satisfying K⊆R^(d) and the number of repetition times T of decision making. Herein, d is a positive number indicating the number of dimensions of the feasible region. For each of the numbers of repetition times t=1, 2, . . . T, the player selects an action (policy) a_(t)∈K and at the same time, the circumstance selects a convex function f_(t). The player observes feedback of the f_(t) (a_(t)) prior to selecting the following policy a_(t)+1. Herein, it is assumed that K is a positive amount, i.e.,

∫_(x∈K)ldx>0.  [Math. 1]

It is assumed that f_(t) is σ-strongly convex and β-smooth. In other words, the following expressions (1) and (2) are satisfied for all x and y∈K:

$\begin{matrix} \left\lbrack {{Math}.2} \right\rbrack &  \\ {{{f_{t}(y)} \geq {{f_{t}(x)} + {{\nabla{f_{t}(x)}^{T}}\left( {y - x} \right)} + {\frac{\sigma}{2}{{y - x}}_{2}^{2}}}},} & (1) \end{matrix}$ $\begin{matrix} \left\lbrack {{Math}.3} \right\rbrack &  \\ {{f_{t}(y)} \leq {{f_{t}(x)} + {{\nabla{f_{t}(x)}^{T}}\left( {y - x} \right)} + {\frac{\beta}{2}{{{y - x}}_{2}^{2}.}}}} & (2) \end{matrix}$

Performance of the player is evaluated with respect to Regret R_(T) (x*), which is an evaluation index. Herein, Regret R_(T) (x*) is defined by the following equation (3) with respect to x*∈K:

$\begin{matrix} \left\lbrack {{Math}.4} \right\rbrack &  \\ {{R_{T}\left( x^{*} \right)} = {{\sum\limits_{t = 1}^{T}{f_{t}\left( a_{t} \right)}} - {\sum\limits_{t = 1}^{T}{{f_{t}\left( x^{*} \right)}.}}}} & (3) \end{matrix}$

In the present disclosure, as for a player capability, a convex benchmark set K′⊆K is selected (employed). Herein, the Regret R_(T) (x*) for x*∈K′ is considered. In other words, A value of

sup_(x*∈K′)E[R_(T)(x*)]  [Math. 5]

and a gap to be expected between a cumulative loss on outputs of algorithms and an optimal independent policy x* belonging to K′ are noted. Note that E[] is an expected value. When the optimal independent policy x* that satisfies the following belongs to K′, a value of

$\begin{matrix} {x^{*} \in {\arg\min\limits_{x \in K}{\sum_{t = 1}^{T}{f_{t}(x)}}}} & \left\lbrack {{Math}.6} \right\rbrack \end{matrix}$ $\begin{matrix} {{\sup_{x^{*}}}_{\in K^{\prime}}{E\left\lbrack {R_{T}\left( x^{*} \right)} \right\rbrack}} & \left\lbrack {{Math}.7} \right\rbrack \end{matrix}$

is equal to a standard worst-case Regret,

sup_(x*∈K)E[R_(T)(x*)]  [Math. 8]

When ||y−x||₂≤γ implies y∈K, x∈R^(d) is γ-interiors of K. For example, when K is expressed by an m-linear inequality, i.e., K is expressed as

K={x∈

^(d) |a _(j) ^(T) x≤b _(j) (j∈[m])},  [Math. 9]

the convex set K′ defined by

K′={x∈

^(d) |a _(j) ^(T) x≤b _(j) −r (j∈[m])}  [Math. 10]

consists of r-interiors of K.

For a generic benchmark set K′⊆K, let r≥0 be a non-negative real value, where all members of K′ are r-interiors. For a special case at K′=K, r is equal to zero. It is further assumed that there is a positive number R>0 where l₂ norm of any element of K′ is at most R.

It also allows access to a membership oracle for K′. This means that x∈R^(d) is given, and it can be determined whether x∈K′ by calling the membership oracle. When K′ is expressed by an m inequality

K′={x∈

^(d) |g _(j)(x)≤0 (j∈[m])},  [Math. 11]

there is an access to a membership oracle for K′. This is because by evaluating g_(i)(x) for i∈[m], it can be checked whether x∈K. Furthermore, when a linear optimization problem on K′ in polynomial time can be solved, it is known from Non Patent Literature 2 to have a membership oracle of polynomial time for K′.

<Notation>

For a vector x=(x₁, . . . , x_(d))^(T)∈R^(d), let the l₂ norm of x is assumed to be ||x||₂. In other words, it is as follows.

||x|| ₂=√{square root over (x ^(T) x)}=√{square root over (Σ_(i=1) ^(d) x _(i) ²)}  [Math. 12]

Let l₂-operator norm be ||X||₂ for a matrix X∈R^(d×d). In other words, ||X||₂=max{||Xy||₂|y∈R^(d), ||y||₂=1}. When X is a symmetric matrix, ||X||₂ is equal to the largest absolute value of an eigenvalue of X. Given a positive half constant sign matrix A∈R^(d×d) and the vector x∈R^(d), ||x||_(A) is defined as follows:

||x|| _(A)=√{square root over (x ^(T) Ax)}=||A^(1/2) x|| _(2.)  [Math. 13]

Similarly, for the matrix XERd, the following is assumed to be ||X||_(A):

||X|| _(A) =||A ^(1/2) XA ^(1/2)||_(2.)  [Math. 14]

<Smoothed Convex Function>

Let v and u be random variables that follow uniform distribution on each of B^(d)={v∈R^(d) | ||v||₂≤1} and S^(d)={u∈R^(d) | ||u||₂=1}. For a convex function f and a regular matrix B∈R^(d×d) on the R^(d), a smoothing function f{circumflex over ( )} is defined by the following equation (4):

{circumflex over (f)} _(B)(x)=E[f(x+B _(v))]  [Math. 15] . . . (4).

<Auxiliary Theorem 1 (Non Patent Literature 3)>

A gradient of f{circumflex over ( )}_(B) is expressed by the following equation (5):

∇{circumflex over (f)} _(B)(x)=E[d·f(x+B _(v))B ⁻¹ u]  [Math. 16] . . . (5).

When f is β-smooth, the following expression (6) is held:

$\begin{matrix} \left\lbrack {{Math}.17} \right\rbrack &  \\ {{0 \leq {{\hat{f_{B}}(x)} - {f(x)}} \leq {\frac{\beta}{2}{{B^{T}B}}_{2}}} = {\frac{\beta}{2}{{\lambda_{1}\left( {B^{T}B} \right)}.}}} & (6) \end{matrix}$

When f is σ-strongly convex, f{circumflex over ( )}_(B) is also σ-strongly convex.

The equation (5) is illustrated by the Stokes' theorem, and the expression (6) is derived from the definition of (β-smooth. In bandit feedback setting, even though an unbiased estimated value of a gradient of f_(t) cannot be utilized, those for the smoothed f{circumflex over ( )}_(t) can be constructed based on the equation (5). A difference between f_(t) and f{circumflex over ( )}_(t) may be bounded by the expression (6).

<Log-Concave Distribution>

A probability distribution on a convex set K⊆R^(d) is called a log-concave distribution when its probability density function p:K→R is expressed as p(x)=exp(−g(x)) by using a convex function g:K→R, where a logarithm of p(x) is a concave function. The algorithm of the present disclosure maintains a log-concave distribution. Random samples from log-concave distributions can be efficiently generated with mild assumptions. In fact, there is a computationally efficient MCMC algorithm for sampling from p in a study that is given a membership oracle for K and an evaluation oracle for g, as illustrated in Non Patent Literature 4. Thus, the present disclosure can efficiently compute an estimated value of a covariance matrix Cov(p) for a mean μ(p) and p. The following auxiliary theorems are useful when limiting variables of the log-concave distribution.

<Auxiliary Theorem 2 (Prop. 10.1 of Non Patent Literature 5)>

It is assumed that a logarithmic concave distribution on K has a probability density function p(x)=exp(−g(x)). Herein, g is a σ-strongly convex function. At this time, a covariance matrix Σ of p satisfies ||Σ||₂≤1/σ.

In order to ensure that an output a_(t) of the algorithm of the present disclosure is included in K, the following auxiliary theorem is used.

<Auxiliary Theorem 3 (Non Patent Literature 4)>

Let p be a logarithmic concave distribution on K. At this time, the following ellipsoid is included in K:

{x∈

^(d) | ||x−μ(p)||_(Cov(p)) ⁻¹ ≤1/e}  [Math. 18]

<Configuration of Optimization Apparatus>

FIG. 3 is a block diagram illustrating a configuration of an optimization apparatus 200 according to the second example embodiment. The optimization apparatus 200 is an information processing apparatus which is a specific example of the optimization apparatus 100 described above. The optimization apparatus 200 includes a storage unit 210, a memory 220, an interface (IF) unit 230, and a control unit 240.

The storage unit 210 is a storage device such as a hard disk or a flash memory. The storage unit 210 stores an exploration parameter 211 and an optimization program 212. The exploration parameter 211 is a parameter used when determining a policy for each round, which will be described later. As the exploration parameter 211, a parameter group calculated for each round may be input from outside and stored. Alternatively, the exploration parameter 211 may be calculated for each round in the optimization apparatus 200. The optimization program 212 is a computer program in which the optimization method according to the present example embodiment is implemented.

The memory 220 is a volatile storage device such as a random access memory (RAM), and is a storage area for temporarily holding information during an operation of the control unit 240. The IF unit 230 is an interface for inputting and outputting data to and from the outside of the optimization apparatus 200. For example, the IF unit 230 receives input data from another computer or the like via a network (not illustrated), and outputs the received input data to the control unit 240. The IF unit 230 outputs data to a destination computer via a network in response to an instruction from the control unit 240. Alternatively, the IF unit 230 receives a user operation via an input device (not illustrated) such as a keyboard, a mouse, or a touch panel, and outputs the received operation content to the control unit 240. In addition, the IF unit 230 performs the output to a touch panel, a display device, a printer, or the like (not illustrated), in response to an instruction from the control unit 240.

The control unit 240 is a processor such as a central processing unit

(CPU), and controls each configuration of the optimization apparatus 200. The control unit 240 reads the optimization program 212 from the storage unit 210 into the memory 220, and executes the optimization program 212. As a result, the control unit 240 achieves functions of the setting unit 241, the policy determination unit 242, the policy execution unit 243, the update rate determination unit 244, and the update unit 245. Each of the setting unit 241, the policy determination unit 242, the policy execution unit 243, the update rate determination unit 244, and the update unit 245 is an example of the setting unit 110, the policy determination unit 120, the policy execution unit 130, the update rate determination unit 140, and the update unit 150 which have been described above.

The setting unit 241 performs initial setting of a predetermined non-linear objective function. In addition, the setting unit 241 receives input of the exploration parameter 211 or a calculation formula of the exploration parameter 211 from the outside as necessary, and stores the received data in the storage unit 210.

The policy determination unit 242 determines a policy to be executed in the online optimization in the bandit problem, based on the nonlinear objective function. Herein, the policy determination unit 242 determines a policy by further using the exploration parameter 211. Furthermore, the policy determination unit 242 may calculate (update, select) the exploration parameter 211, based on the number of trials of updating in the non-linear objective function, and may determine the policy by further using the calculated exploration parameter 211. Further, the policy determination unit 242 may calculate the exploration parameter 211, based on a distance from a boundary of a feasible region. For example, the policy determination unit 242 calculates the exploration parameter 211 by using the above input calculation formula for each round. Alternatively, the policy determination unit 242 may use the exploration parameter 211 calculated outside in advance in accordance with the number of rounds. In addition, the policy determination unit 242 calculates a mean value and a covariance matrix of a plurality of samples generated based on the estimated value of the non-linear objective function, and determines a policy by further using the mean value and the covariance matrix.

The policy execution unit 243 executes the policy determined by the policy determination unit 242 as an input of a non-linear objective function, and acquires a reward as an execution result (value of the objective function).

The update rate determination unit 244 determines an update rate of the non-linear objective function by the multiplicative weight update method, based on the reward acquired by the policy execution unit 243 and the non-linear objective function.

The update unit 245 updates the non-linear objective function, based on the update rate determined by the update rate determination unit 244.

<Flow of Optimization Method>

FIG. 4 is a flowchart illustrating a flow of the optimization method according to the second example embodiment. First, as a precondition, an upper limit value T∈N of rounds (the number of repetition times), a membership oracle MO for a learning rate η>0, and a strong-convexity parameter σ>0 are assumed. It is also assumed that an exploration parameter α_(t) satisfies the following:

{α_(t)}_(t=1) ^(T)⊆

_(>0.)  [Math. 19]

In addition, the optimization apparatus 200 holds a function z_(t) on K′, based on a multiplicative weights update method according to Non Patent Literature 6 in the memory 220.

First, the setting unit 241 initializes z_(t) of (an estimated value of) the non-linear objective function as described below (S201).

$\begin{matrix} {{{z_{1}(x)} = {\sigma\frac{{x}_{2}^{2}}{2}}},} & \left\lbrack {{Math}.20} \right\rbrack \end{matrix}$

where p_(t) is a probability distribution on K′ using a density proportional to exp(−ηz_(t)(x)) for each round. In other words, Z_(t) and p_(t) are defined by the following expression (7):

$\begin{matrix} \left\lbrack {{Math}.21} \right\rbrack &  \\ {{Z_{t}:={\int_{x \in {K\prime}}{{\exp\left( {- {{\eta z}_{t}(x)}} \right)}{dz}}}},{{p_{t}(x)} = {\frac{\exp\left( {- {{\eta z}_{t}(x)}} \right)}{Z_{t}}.}}} & (7) \end{matrix}$

Next, the control unit 240 increments t by one from round t=1 to T, and repeats the following steps S203 to S210 (S202).

First, the policy determination unit 242 generates an x_(t) ^((M)) from a sample x_(t) ⁽¹⁾ by p_(t). Herein, M is the number of samples generated and M≥1. Note that a method of generating samples from p_(t) will be described later. Next, the policy determination unit 242 calculates the mean estimated value μ{circumflex over ( )}_(t) and an estimated value Σ{circumflex over ( )}_(t) of the covariance matrix by the following equation (9) from the generated sample x_(t) ⁽¹⁾ in such a way as to satisfy the following expression (8) from the x_(t) ^((M)) (S203). Note that μ_(t) and Σ_(t) represent a mean and a covariance matrix for p_(t).

$\begin{matrix} \left\lbrack {{Math}.22} \right\rbrack &  \\ {{{{{\hat{\mu}}_{t} - \mu_{t}}}_{\sum_{t}^{- 1}} \leq {1/9}},{{{{\hat{\Sigma}}_{t} - \Sigma_{t}}}_{\sum_{t}^{- 1}} \leq {1/9}},{{E\left\lbrack {{\hat{\mu}}_{t}❘\mu_{t}} \right\rbrack} = {\mu_{t}.}}} & (8) \end{matrix}$ $\begin{matrix} \left\lbrack {{Math}.23} \right\rbrack &  \\ {{{\hat{\mu}}_{t} = {\frac{1}{M}{\sum\limits_{j = 1}^{M}x_{t}^{(j)}}}},{{\hat{\Sigma}}_{t} = {\underset{j = 1}{\overset{M}{\frac{1}{M}\sum}}{\left( {x_{t}^{(j)} - {\hat{\mu}}_{t}} \right){\left( {x_{t}^{(j)} - {\hat{\mu}}_{t}^{(j)}} \right).}}}}} & (9) \end{matrix}$

Herein, when M is set to be sufficiently large, the expression (8) is maintained with a high probability.

Next, the policy determination unit 242 calculates a matrix B_(t). Herein, the policy determination unit 242 calculates a matrix B_(t)∈R^(d×d) in such a way as to satisfy the following:

B_(t) ^(T)B_(t)={circumflex over (Σ)}_(t).  [Math. 24]

For example, the policy determination unit 242 can calculate the matrix B_(t) by Cholesky decomposition algorithms.

The policy determination unit 242 selects the exploration parameter at (S205). A method of selecting the exploration parameter α_(t) will be described later.

Then, the policy determination unit 242 smooths a function f_(t) by using B=α_(t)B_(t) in the above equation (4), and calculates an expected value f{circumflex over ( )}_(t) of the function by the following expression (10):

{circumflex over (f)} _(t)(x):=E[f _(t)(x+α _(t)B_(t)v)]  [Math. 25] . . . (10).

Herein, v is uniformly distributed on B^(d)={v∈R^(d) | ||v||₂≤1} as described above.

Then, the policy determination unit 242 randomly and uniformly selects u_(t) from a unit sphere S^(d)={u∈R^(d) | ||u||₂=1} (S206). Then, the policy determination unit 242 calculates (determines) a policy a_(t) by a_(t)=μ_(t)+α_(t)B_(t)u_(t) by using a mean μ_(t), the exploration parameter α_(t), the matrix B_(t) and the selected u_(t) (S207).

Then, the policy execution unit 243 executes the determined policy at and acquires the execution result, specifically, the policy execution unit 243 inputs the policy a_(t) to a function f_(t)(x), and observes the value f_(t) (a_(t)) of the function to be outputted (S208).

Based on this observation, the update rate determination unit 244 calculates an update rate g{circumflex over ( )}t∈R^(d) by the following equation (11) (S209):

ĝ _(t) =d·f _(t)(a _(t))(α_(t) B _(t))⁻¹ u _(t)  [Math. 26] . . . (11).

This is a random estimated value of a gradient ∇f{circumflex over ( )}_(t)(μ{circumflex over ( )}_(t)). In other words, since μ{circumflex over ( )}_(t) and B_(t) are given, a conditional expected value of g{circumflex over ( )}_(t) satisfies the following equation (12):

E[ĝ _(t)]=E[d·f_(t)({circumflex over (μ)}_(t)+α_(t) B _(t) u _(t))(α_(t) B _(t))⁻¹ u _(t) ]=∇{circumflex over (f)} _(t)({circumflex over (μ)}_(t))  [Math. 27] . . . (12),

where a second inequality is acquired from the above equation (5).

The update unit 245 updates z_(t) by using the random estimated value g{circumflex over ( )}_(t) as illustrated in the following equation (13) (S210):

$\begin{matrix} \left\lbrack {{Math}.28} \right\rbrack &  \\ {{z_{t + 1}(x)} = {{z_{t}(x)} + {{\overset{\hat{}}{g}}_{t}^{T}\left( {x - {\overset{\hat{}}{\mu}}_{t}} \right)} + {\frac{\sigma}{2}{{{x - {\overset{\hat{}}{\mu}}_{t}}}_{2}^{2}.}}}} & (13) \end{matrix}$

Thereafter, the control unit 240 determines whether the round t=T (S211), and when t is less than T, the control unit 240 returns to the step S202, performs t=t+1, and executes the step S203 and subsequent steps. In the step S211, when t is T, the present optimization processing ends.

<Examples of Method of Generating Samples from p_(t)s>

A simple example of a method of generating samples from p_(t) is to use a normal distribution. Herein, since p_(t) is defined by the above z₁ (x), expression (7), and equation (13), the distribution p_(t) is a multi-dimensional cut normal distribution on K expressed as follows:

$\begin{matrix} {{{p_{t}(x)} \propto {{\exp\left( {- \frac{{{\sigma\eta}t}{{x - \theta_{t}}}_{2}^{2}}{2}} \right)}\left( {x \in K^{\prime}} \right)}},} & \left\lbrack {{Math}.29} \right\rbrack \end{matrix}$ p_(t)(x) = 0(x ∈ ℝ^(d) ∖ K^(′)), where $\begin{matrix} {\theta_{t} = {\frac{1}{t}{\sum_{j = 1}^{t - 1}{\left( {{\hat{\mu}}_{j} - {\frac{1}{\sigma}{\hat{g}}_{j}}} \right).}}}} & \left\lbrack {{Math}.30} \right\rbrack \end{matrix}$

Therefore, by sampling x from a normal distribution

$\begin{matrix} {N\left( {\theta_{t},{\frac{1}{{\sigma\eta}t}I}} \right)} & \left\lbrack {{Math}.31} \right\rbrack \end{matrix}$

to x∈K′, x can be acquired following pt.

However, although the above processing is sufficiently practical in many cases, it does not necessarily end in polynomial time. In such a case, since pt is a log-concave distribution, a polynomial temporal sampling method based on MCMC (Non Patent Literature 4) can be applied instead of the above processing. At this time, the membership oracle may be called. Note that as for a more efficient method of computing μ{circumflex over ( )} and Σ{circumflex over ( )} and sampling from p_(t) the technique of Non Patent Literature 7 can be used.

<Example of Method of Selecting Exploration Parameter α_(t)>

In the step S205 described above, the policy determination unit 242 needs to select α_(t) in such a way that a_(t)=μ{circumflex over ( )}_(t)+α_(t)B_(t)u_(t) is an executable solution, i.e., a_(t)∈K. The following proposal provides sufficient conditions for this,

<Proposal>

When α_(t) is bounded by

$\begin{matrix} {{0 < \alpha_{t} \leq {\frac{1}{9} + {r\sqrt{\frac{t{\eta\sigma}}{2}}}}},} & \left\lbrack {{Math}.32} \right\rbrack \end{matrix}$

a_(t)=μ{circumflex over ( )}_(t)+α_(t)B_(t)u_(t) is within K.

<Proof>

α_(t1) and α_(t2) is assumed to be

α_(t1)≤1/9, α_(t2)≤r √{square root over (tησ/2)}  [Math. 33]

and a positive number such as α_(t)=α_(t1)+α_(t2). a_(t) is expressed as a_(t)=μ{circumflex over ( )}_(t)+α_(t1)B_(t)u_(t)+α_(t2)B_(t)u_(t), and since all points of K′ are r-interiors of K, it suffices to indicate (i) μ{circumflex over ( )}_(t)+α_(t1)B_(t)u_(t)∈K′ and (ii) ||α_(t2)B_(t)u_(t)||₂≤r.

From Auxiliary Theorem 3,

||{circumflex over (μ)}_(t)+α_(t1)B_(t)u_(t)−μ_(t)||₉₃ _(t) ⁻¹ ≤1/e  [Math. 34]

implies μ{circumflex over ( )}_(t)+α_(t1)B_(t)u_(t)∈K′. From the triangular inequality, expression (14) is provided.

$\begin{matrix} \left\lbrack {{Math}.35} \right\rbrack &  \\ {{{{\overset{\hat{}}{\mu}}_{t} + {\alpha_{t1}B_{t}u_{t}} - \mu_{t}}}_{\sum_{t}^{- 1}} \leq {{{{\overset{\hat{}}{\mu}}_{t} - \mu_{t}}}_{\sum_{t}^{- 1}} + {\alpha_{t1}{{B_{t}u_{t}}}_{\sum_{t}^{- 1}}}} \leq {\frac{1}{9} + {\frac{1}{9}{{\Sigma_{t}^{{- 1}/2}B_{t}u_{t}}}_{2}}} \leq {\frac{1}{9}\left. ({1 + {{\Sigma_{t}^{{- 1}/2}B_{t}}}_{2}} \right)}} & {(14)} \end{matrix}$

From the expression (8),

||Σ_(t) ^(−1/2)B_(t)||₂≤2  [Math. 36]

is led. By combining this with the expression (14), the following implications for holding (i) is acquired:

||{circumflex over (μ)}_(t)+α_(t1)B_(t)u_(t)−μ_(t)||_(Σ) _(t) ⁻¹ ≤1/3≤1/e  [Math. 37]

Herein, since ηz_(t)(x) is a (tησ)-strongly convex function, a covariance matrix Σt=Cov(pt)

||Σ_(t)||₂≤1/(tησ)  [Math. 38]

is defined as the boundary from the Auxiliary Theorem 2. From this and the expression (8),

||{circumflex over (Σ)}_(t)||₂≤2/(tησ)  [Math. 39]

is acquired. Therefore, the following expression (15) is acquired:

||B_(t)||₂≤√{square root over (||{circumflex over (Σ)}_(t)||₂)}≤√{square root over (2/(tησ))}  [Math. 40] . . . (15).

From the expression (15) and

||u_(t)||−1, and α_(t)≤r√{square root over (tησ/2)},  [Math. 41]

(ii) is denoted.

Based on the above, it is desirable that the exploration parameter α_(t) is selected by the following equation (16). The equation (16) is an example of calculation formulas of the exploration parameter α_(t) described above:

$\begin{matrix} \left\lbrack {{Math}.42} \right\rbrack &  \\ {\alpha_{t} = {\min{\left\{ {{\frac{1}{9} + {r\sqrt{\frac{t{\eta\sigma}}{2}}}},\sqrt{d}} \right\}.}}} & (16) \end{matrix}$

Namely, the exploration parameter at is calculated based on the round t. For example, in the step S205, the policy determination unit 242 may calculate the exploration parameter α_(t) by applying the current round t to the equation (16). Alternatively, when the exploration parameter α_(t) for each round t is calculated in advance outside the optimization apparatus 200 and already acquired as the exploration parameter 211, the policy determination unit 242 may select and read out the exploration parameter α_(t) associated to the round t at that time from the exploration parameters 211 in the storage unit 210.

Further, r is less than the shortest distance from a boundary of the feasible region K. In other words, it indicates that a circle having a radius r centered on the exploration parameter α_(t) exists in the feasible region K. Also, r can be referred to as an index indicating how far the optimal solution of α exists inside the feasible region K. In addition, it can be said that the policy determination unit 242 applies the rounds t and r at that time to the equation (16) and calculates the exploration parameter at in the step S205.

<Effect>

The Regret Limit in the above-mentioned Non Patent Literature 1 is

Õ(d^(2/3)T^(2/3)),  [Math. 43]

but a Regret Limit in this disclosure is

Õ(d√{square root over (T)}).  [Math. 44]

Non Patent Literature 3 has failed to efficiently construct self-concordant barriers for general convex sets. With a v-self-concordant barrier, the Regret Limit implies that there is a gap of

√{square root over (v)}  [Math. 45]

between the upper and lower limits, and therefore,

Õ(d√{square root over (vT)}).  [Math. 46]

Because v is generally at least d for any compression convex set K, from the lower limit of

0(d√{square root over (T)}),  [Math. 47]

there is a gap of

Ω(√{square root over (d)}).  [Math. 48]

Also, this is because, when there is a self-concordant barrier with a small v, for example, when K is expressed by m(>>d) linear inequality, the gap is even worse.

In contrast, the present disclosure can overcome the above-mentioned problems.

(i) Under mild assumptions, the algorithm of the present disclosure is a minimax optimal factor up to a logarithm.

Õ(d√{square root over (T)})  [Math. 49]

Regret is accomplished. The result is a first rigid boundary for the bandit convex optimization that applies to the constraint problem. Given the assumption that more accurately, optimal solutions exist in r-interiors, the algorithms of the present disclosure acquire a Regret Limit of

Õ(d√{square root over (T)}+d²/r²).  [Math. 50]

Moreover, even in a case of the absence of interior assumptions, the algorithm has a Regret Limit of

Õ(d^(3/2)√{square root over (T)}),  [Math. 51]

which is at least better than known algorithms.

(ii) The algorithm of the present disclosure does not require a self-concordant barrier. In fact, it is assumed that it has access to the membership oracle for a feasible region. This means that even if K is expressed by an exponentially large number of linear inequalities, or is a record given a known obvious form of K, the algorithm of the present disclosure works well.

In addition, efficient algorithms for sampling from logarithmic concave distributions can be executed in polynomial time.

Other Example Embodiments

In the above example embodiment, the hardware configuration has been described, but the present invention is not limited thereto. The present disclosure is also able to achieve any processing by causing a central processing unit (CPU) to execute a computer program.

In the above examples, programs may be stored and provided to a computer by using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of the non-transitory computer readable media include a magnetic recording medium (e.g., a flexible disk, a magnetic tape, a hard disk drive), a magneto-optical recording medium (e.g., a magneto-optical disk), a read only memory (CD-ROM), a CD-R, a CD-R/W, a digital versatile disc (DVD), and a semiconductor memory (e.g., a mask ROM, a programmable ROM (PROM), an erasable PROM (EPROM), a flash ROM, a random access memory (RAM)). The program may also be supplied to the computer by various types of transitory computer readable media. Examples of the transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium may provide the program to the computer via wired communication paths, such as an electrical wire and an optical fiber, or a wireless communication path.

Note that the present disclosure is not limited to the above-mentioned example embodiments, and can be modified as appropriate within a range not deviating from the gist. The present disclosure may be implemented by appropriately combining respective example embodiments.

A part or all of the above example embodiments may also be described as the following supplementary notes, but are not limited to the following.

(Supplementary Note A1)

An optimization apparatus including:

a setting unit that sets a predetermined non-linear objective function;

a policy determination unit that determines a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;

a policy execution unit that acquires a reward as an execution result of the determined policy;

an update rate determination unit that determines an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and

an update unit that updates the non-linear objective function, based on the update rate.

(Supplementary Note A2)

The optimization apparatus according to supplementary note A1, wherein the policy determination unit determines the policy by further using a predetermined exploration parameter.

(Supplementary Note A3)

The optimization apparatus according to supplementary note A1, wherein the policy determination unit

calculates the exploration parameter, based on the number of trials of updating in the non-linear objective function, and

determines the policy by further using the calculated exploration parameter.

(Supplementary Note A4)

The optimization apparatus according to supplementary note A3, wherein the policy determination unit

further calculates the exploration parameter, based on a distance from a boundary of a feasible region.

(Supplementary Note A5)

The optimization apparatus according to any one of supplementary notes A1 to A4, wherein the policy determination unit

calculates a mean value and a covariance matrix of a plurality of samples that are generated based on an estimated value of the non-linear objective function, and

determines the policy by further using the mean value and the covariance matrix.

(Supplementary Note B1)

An optimization method including,

setting a predetermined non-linear objective function;

determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;

acquiring a reward as an execution result of the determined policy;

determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the nonlinear objective function; and

updating the non-linear objective function, based on the update rate.

(Supplementary Note C1)

A non-transitory computer readable medium having stored therein an optimization program for causing a computer to execute:

setting processing of setting a predetermined non-linear objective function;

policy determination processing of determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function;

policy execution processing of acquiring a reward as an execution result of the determined policy;

update rate determination processing of determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and

update processing of updating the non-linear objective function, based on the update rate.

The present invention has been described above with reference to example embodiments (and examples), but the present invention is not limited to the above example embodiments (and examples). Various modifications can be made to the structure and details of the present invention which can be understood by a person skilled in the art within the scope of the present invention.

REFERENCE SIGNS LIST

-   100 Optimization apparatus -   110 Setting unit -   120 Policy determination unit -   130 Policy execution unit -   140 Update rate determination unit -   150 Update unit -   200 Optimization apparatus -   210 Storage unit -   211 Exploration parameter -   212 Optimization program -   220 Memory -   230 IF unit -   240 Control unit -   241 Setting unit -   242 Policy determination unit -   243 Policy execution unit -   244 Update rate determination unit -   245 Update unit 

What is claimed is:
 1. An optimization apparatus comprising: at least one memory configured to store instructions; and at least one processor configured to execute the instructions to: set a predetermined non-linear objective function; determine a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function; acquire a reward as an execution result of the determined policy; determine an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and update the non-linear objective function, based on the update rate.
 2. The optimization apparatus according to claim 1, wherein the at least one processor is further configured to execute the instructions to: determine the policy by further using a predetermined exploration parameter.
 3. The optimization apparatus according to claim 2, wherein the at least one processor is further configured to execute the instructions to: calculate the exploration parameter, based on the number of trials of updating in the non-linear objective function, and determine the policy by further using the calculated exploration parameter.
 4. The optimization apparatus according to claim 3, wherein the at least one processor is further configured to execute the instructions to: further calculate the exploration parameter, based on a distance from a boundary of a feasible region.
 5. The optimization apparatus according to claim 1, wherein the at least one processor is further configured to execute the instructions to: calculate a mean value and a covariance matrix of a plurality of samples that are generated based on an estimated value of the non-linear objective function, and determine the policy by further using the mean value and the covariance matrix.
 6. An optimization method comprising, by a computer: setting a predetermined non-linear objective function; determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function; acquiring a reward as an execution result of the determined policy; determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and updating the non-linear objective function, based on the update rate.
 7. A non-transitory computer-readable medium storing an optimization program causing a computer to execute: setting processing of setting a predetermined non-linear objective function; policy determination processing of determining a policy to be executed in online optimization in a bandit problem, based on the non-linear objective function; policy execution processing of acquiring a reward as an execution result of the determined policy; update rate determination processing of determining an update rate of the non-linear objective function by a multiplicative weight update method, based on the acquired reward and the non-linear objective function; and update processing of updating the non-linear objective function, based on the update rate. 